NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Learning-Augmented Hierarchical Clustering

Braverman, Vladimir; Ergun, Jon; Wang, Chen; Zhou, Samson (July 2025, Proceedings of Machine Learning Research)

Full Text Available
On the Price of Differential Privacy for Hierarchical Clustering

Deng, Chengyuan; Gao, Jie; Upadhyay, Jalaj; Wang, Chen; Zhou, Samson (April 2025, International Conference on Representation Learning 2025 (ICLR 2025))

Full Text Available
On Approximability of 𝓁₂² Min-Sum Clustering

https://doi.org/10.4230/lipics.socg.2025.62

S, Karthik C; Lee, Euiwoong; Rabani, Yuval; Schwiegelshohn, Chris; Zhou, Samson (January 2025, Schloss Dagstuhl – Leibniz-Zentrum für Informatik)
Aichholzer, Oswin; Wang, Haitao (Ed.)
The 𝓁₂² min-sum k-clustering problem is to partition an input set into clusters C_1,…,C_k to minimize ∑_{i=1}^k ∑_{p,q ∈ C_i} ‖p-q‖₂². Although 𝓁₂² min-sum k-clustering is NP-hard, it is not known whether it is NP-hard to approximate 𝓁₂² min-sum k-clustering beyond a certain factor. In this paper, we give the first hardness-of-approximation result for the 𝓁₂² min-sum k-clustering problem. We show that it is NP-hard to approximate the objective to a factor better than 1.056 and moreover, assuming a balanced variant of the Johnson Coverage Hypothesis, it is NP-hard to approximate the objective to a factor better than 1.327. We then complement our hardness result by giving a fast PTAS for 𝓁₂² min-sum k-clustering. Specifically, our algorithm runs in time O(n^{1+o(1)}d⋅ 2^{(k/ε)^O(1)}), which is the first nearly linear time algorithm for this problem. We also consider a learning-augmented setting, where the algorithm has access to an oracle that outputs a label i ∈ [k] for input point, thereby implicitly partitioning the input dataset into k clusters that induce an approximately optimal solution, up to some amount of adversarial error α ∈ [0,1/2). We give a polynomial-time algorithm that outputs a (1+γα)/(1-α)²-approximation to 𝓁₂² min-sum k-clustering, for a fixed constant γ > 0.
more » « less
Full Text Available
A Strong Separation for Adversarially Robust $$\ell_0$$ Estimation for Linear Sketches

https://doi.org/10.1109/FOCS61266.2024.00136

Gribelyuk, Elena; Lin, Honghao; Woodruff, David P; Yu, Huacheng; Zhou, Samson (October 2024, IEEE)

Full Text Available
Private Vector Mean Estimation in the Shuffle Model: Optimal Rates Require Many Messages

Asi, Hilal; Feldman, Vitaly; Nelson, Jelani; Nguyen, Huy; Talwar, Kunal; Zhou, Samson (September 2024, OpenReview.net)

Full Text Available
Streaming Algorithms with Few State Changes

https://doi.org/10.1145/3651145

Jayaram, Rajesh; Woodruff, David P; Zhou, Samson (May 2024, Proceedings of the ACM on Management of Data)

In this paper, we study streaming algorithms that minimize the number of changes made to their internal state (i.e., memory contents). While the design of streaming algorithms typically focuses on minimizing space and update time, these metrics fail to capture the asymmetric costs, inherent in modern hardware and database systems, of reading versus writing to memory. In fact, most streaming algorithms write to their memory on every update, which is undesirable when writing is significantly more expensive than reading. This raises the question of whether streaming algorithms with small space and number of memory writes are possible. We first demonstrate that, for the fundamental F_pmoment estimation problem with p ≥ 1, any streaming algorithm that achieves a constant factor approximation must make Ω(n^1-1/p) internal state changes, regardless of how much space it uses. Perhaps surprisingly, we show that this lower bound can be matched by an algorithm which also has near-optimal space complexity. Specifically, we give a (1+ε)-approximation algorithm for F_pmoment estimation that use a near-optimal ~O_ε(n^1-1/p) number of state changes, while simultaneously achieving near-optimal space, i.e., for p∈[1,2), our algorithm uses poly(log n,1/ε) bits of space for, while for p>2, the algorithm uses ~O_ε(n^1-1/p) space. We similarly design streaming algorithms that are simultaneously near-optimal in both space complexity and the number of state changes for the heavy-hitters problem, sparse support recovery, and entropy estimation. Our results demonstrate that an optimal number of state changes can be achieved without sacrificing space complexity.
more » « less
Full Text Available
Private Vector Mean Estimation in the Shuffle Model: Optimal Rates Require Many Messages

Asi, Hilal; Feldman, Vitaly; Nelson, Jelani; Nguyen, Huy; Talwar, Kunal; Zhou, Samson (July 2024, Proceedings of the 41st International Conference on Machine Learning (ICML))

Full Text Available
Private Vector Mean Estimation in the Shuffle Model: Optimal Rates Require Many Messages

Asi, Hilal; Feldman, Vitaly; Nelson, Jelani; Nguyen, Huy L; Talwar, Kunal; Zhou, Samson (July 2024, Proceedings of Machine Learning Research)

We study the problem of private vector mean estimation in the shuffle model of privacy where n users each have a unit vector v^{(i)} in R^d. We propose a new multi-message protocol that achieves the optimal error using O~(min(n*epsilon^2, d)) messages per user. Moreover, we show that any (unbiased) protocol that achieves optimal error requires each user to send Omega(min(n*epsilon^2,d)/log(n)) messages, demonstrating the optimality of our message complexity up to logarithmic factors. Additionally, we study the single-message setting and design a protocol that achieves mean squared error O(dn^{d/(d+2)} * epsilon^{-4/(d+2)}). Moreover, we show that any single-message protocol must incur mean squared error Omega(dn^{d/(d+2)}), showing that our protocol is optimal in the standard setting where epsilon = Theta(1). Finally, we study robustness to malicious users and show that malicious users can incur large additive error with a single shuffler.
more » « less
Full Text Available
Bandwidth-Hard Functions: Reductions and Lower Bounds

https://doi.org/10.1007/s00145-024-09497-3

Blocki, Jeremiah; Liu, Peiyuan; Ren, Ling; Zhou, Samson (April 2024, Journal of Cryptology)

Memory Hard Functions (MHFs) have been proposed as an answer to the growing inequality between the computational speed of general purpose CPUs and ASICs. MHFs have seen widespread applications including password hashing, key stretching and proofs of work. Several metrics have been proposed to quantify the memory hardness of a function. Cumulative memory complexity (CMC) quantifies the cost to acquire/build the hardware to evaluate the function repeatedly at a given rate. By contrast, bandwidth hardness quantifies the energy costs of evaluating this function. Ideally, a good MHF would be both bandwidth hard and have high CMC. While the CMC of leading MHF candidates is well understood, little is known about the bandwidth hardness of many prominent MHF candidates. Our contributions are as follows: First, we provide the first reduction proving that, in the parallel random oracle model (pROM), the bandwidth hardness of a data-independent MHF (iMHF) is described by the red-blue pebbling cost of the directed acyclic graph associated with that iMHF. Second, we show that the goals of designing an MHF with high CMC/bandwidth hardness are well aligned. Any function (data-independent or not) with high CMC also has relatively high bandwidth costs. Third, we prove that in the pROM the prominent iMHF candidates such as Argon2i, aATSample and DRSample are maximally bandwidth hard. Fourth, we prove the first unconditional tight lower bound on the bandwidth hardness of a prominent data-dependent MHF called Scrypt in the pROM. Finally, we show the problem of finding the minimum cost red–blue pebbling of a directed acyclic graph is NP-hard.
more » « less
Full Text Available
Differentially Private L2-Heavy Hitters in the Sliding Window Model

Blocki, Jeremiah; Lee, Seunghoon; Mukherjee, Tamalika; Zhou, Samson (February 2023, Eleventh International Conference on Learning Representations (ICLR 2023))

The data management of large companies often prioritize more recent data, as a source of higher accuracy prediction than outdated data. For example, the Facebook data policy retains user search histories for months while the Google data retention policy states that browser information may be stored for up to months. These policies are captured by the sliding window model, in which only the most recent statistics form the underlying dataset. In this paper, we consider the problem of privately releasing the L2-heavy hitters in the sliding window model, which include Lp-heavy hitters for p<=2 and in some sense are the strongest possible guarantees that can be achieved using polylogarithmic space, but cannot be handled by existing techniques due to the sub-additivity of the L2 norm. Moreover, existing non-private sliding window algorithms use the smooth histogram framework, which has high sensitivity. To overcome these barriers, we introduce the first differentially private algorithm for L2-heavy hitters in the sliding window model by initiating a number of L2-heavy hitter algorithms across the stream with significantly lower threshold. Similarly, we augment the algorithms with an approximate frequency tracking algorithm with significantly higher accuracy. We then use smooth sensitivity and statistical distance arguments to show that we can add noise proportional to an estimation of the norm. To the best of our knowledge, our techniques are the first to privately release statistics that are related to a sub-additive function in the sliding window model, and may be of independent interest to future differentially private algorithmic design in the sliding window model.
more » « less
Full Text Available

« Prev Next »

Search for: All records